from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
The Murder Accountability Project is a nonprofit organization that discovers discrepancies between the reported homicides of medical examiners and the FBI voluntary crime report. The database is considered to be one of the most exhaustive collection of homicide records that is currently avaiable in the United States. Additional information about the Murder Accountability Project organization can be found at Murder Accountability Project.
This dataset was collected to investigate the failure of reporting homicides by law enforcement agencies. The dataset is for the timeframe of 1980 to 2014 and includes demographic information such as gender, age, race, and ethnicity for both victims and perpetrators. A more in depth description of the attributes may be found in the Data Description section.
In the midst of documentaries, cold case files, etc., the many homicide cases which go unsolved are placed under a spotlight for examination. Each year, according to the Murder Accountability Project, an estimated ~5000 murderers get away with murder, with the rate increasing to nearly 1/3 of the homicides reported. While hundreds of thousands of Americans are murdered, many are unaccounted for due to the lack of documentation for failed homicide cases. The dataset and the organizations focus is to educate the public on the significance of unsolved cases.
Both response variables are categorical. K-fold cross validation (CV) will be used to evaluate the effectiveness of the classification prediction algorithm and measured using ROC Curve, Accuracy, and Sensitivity & Specificity.
The output below displays the initial data.
# IMPORT LIBRARIES
# hide warnings
import warnings
warnings.filterwarnings('ignore')
# all imported libraries used for analysis
import numpy as np
import pandas as pd
import os
import urllib
import copy
import plotly
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import statsmodels.api as sm
import random
import random
import us
from geopy.geocoders import Nominatim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.utils import resample
from sklearn.feature_selection import RFE
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from datetime import datetime
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from pandas.plotting import scatter_matrix
# set color scheme and style for seaborn
sns.set(color_codes=True)
sns.set_style('whitegrid')
# Read the database.csv file and store in a dataframe
df=pd.read_csv('../Data/database.csv')
# take a peak at the dataframe to validate the dataframe is populated
df.head()
The total number of records and attributes in the originating dataset.
# print the number of records and attributes in the dataframe
records = len(df)
attributes = df.columns
print(f'Total Number of Records: {records} \nTotal Number of Attributes: {len(attributes)}')
The table below displays the attribute name, description and data type. In terms of unique classes, Agency Name and Agency Code both are representations of the same data. During the data cleaning phase, specific columns are removed due to lack of significance or restatement of information.
While reviewing the attributes, an interesting note is that the victim count and perpetrator count are the additional persons involved in the crime. Each victim will have their own record, in addition with being accounted for in the column, resulting in the number of records equating to the number of victims. Perpetrators, in contrast, do not have separate records, resulting in the sum of perpetrator count, with the unique cases to be the total number of perpetrators.
# summary of the variables, counts, null state, and data type
df.info()
df_description = pd.read_excel('../Data/data_description.xlsx')
pd.set_option('display.max_colwidth', 0)
df_types = pd.DataFrame(df.dtypes, columns=['Data Type']).reset_index().rename(columns={'index': 'Attributes'})
df_description = pd.merge(df_description, df_types, on='Attributes', how='inner')
df_description
In order to retreive the total number of perpetrators, its assumed that there is an incindent per month which indicates the unique value of each case. Incident is not a total count of incidents for the one record. Each case has at least one perpetrator. Grouping by the incident, month, and perpetrator count, the number of unique records with the addition of the perpetrator count results in the total number of perpetrators.
perpetrators_df = df.groupby(['Incident', 'Month', 'Perpetrator Count']).count().reset_index()[['Incident', 'Month', 'Perpetrator Count']]
single_perpetrators = len(perpetrators_df)
additional_perpetrators = perpetrators_df['Perpetrator Count'].sum()
# additional_perpetrators = df['Perpetrator Count'].sum()
total_perpetrators = single_perpetrators + additional_perpetrators
print(f'Total number of perpetrators: {total_perpetrators}')
print(f'Total number of victims: {len(df)}')
# check for nulls and put into a dataframe
df_null = pd.DataFrame(df.isnull().sum(), columns=['null_count'])
# filter on null counts that are not 0
df_null.loc[df_null['null_count'] != 0]
While no nulls were present, the data has ' ' as a value, which is replaced with 0 to align with unknown values. Considering that there is only one record which matches this criteria, it is concluded to be a data entry error. As for why 0 is used rather than 998 is due to the lack of 998 being utilized, and 0 indicating missing data.
blank_index = df.loc[df['Perpetrator Age']== ' '].index.values[0]
df.at[blank_index, 'Perpetrator Age'] = '0'
df.loc[df['Perpetrator Age']== 998]
Another anomoly discovered is when the perpetrator's age is 0. This would mean a child under the age of 1 committed the homicide. This has very little credibility. When filtering for records matching this scenario, the majority return the remainder of the perpetrator's demographics as unknown. This indicates that the age of the perpetrator is likely unknown, rather than below one year of age. Something to note is that although the age is zero, there are 601 records which are solved and 4,648 that were not solved. This difference can be attributed to not collecting or filing this data into the record.
crime_age_0 = pd.DataFrame(df.loc[df['Perpetrator Age']== '0'].groupby('Crime Solved').count()['Record ID'].rename({'Record ID': 'Count'}))
crime_age_0
def checkDuplicate (df):
df_duplicates = df.groupby(df.columns.tolist(),as_index=False).size()
duplicates = len(df_duplicates.loc[df_duplicates['size'] > 1])
if duplicates == 0:
print('NO, duplicate instances are not present')
else:
print('Yes, duplicate instances are present')
# lets check if duplicate instances are present
dup_result = checkDuplicate(df)
# create list of continuous variables
df['Perpetrator Age'] = df['Perpetrator Age'].astype(int)
continuous_att = np.delete(df.describe().columns.values, 0, 0) # remove record id
continuous_att = np.delete(continuous_att, 1, 0) # remove incident
# replace 0 with NaN
df_continuous = df[list(continuous_att)]
df_continuous['Victim Age'] = df_continuous['Victim Age'].replace(998, np.NaN)
df_continuous['Perpetrator Age'] = df_continuous['Perpetrator Age'].replace(0, np.NaN)
# function to create boxplots
def create_boxplots(df, continuous):
fig = make_subplots(rows=1, cols=len(continuous))
for i in range(len(continuous)):
fig.add_trace(go.Box(y=df[continuous[i]], name = continuous_att[i]), row = 1, col=i+1)
fig.update_layout(title='Boxplot Outlier Detection')
fig.show()
# call function
create_boxplots(df_continuous,continuous_att)